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Abstract 

A major challenge in crowdsourcing evaluation tasks like labeling objects, grading assign¬ 
ments in online courses, etc., is that of eliciting truthful responses from agents in the absence 
of verifiability. In this paper, we propose new reward mechanisms for such settings that, unlike 
many previously studied mechanisms, impose minimal assumptions on the structure and knowl¬ 
edge of the underlying generating model, can account for heterogeneity in the agents’ abilities, 
require no extraneous elicitation from them, and furthermore allow their beliefs to be (almost) 
arbitrary. These mechanisms have the simple and intuitive structure of an output agreement 
mechanism: an agent gets a reward if her evaluation matches that of her peer, but unlike the 
classic output agreement mechanism, this reward is not the same across evaluations, but is 
inversely proportional to an appropriately defined popularity index of each evaluation. The 
popularity indices are computed by leveraging the existence of a large number of similar tasks, 
which is a typical characteristic of these settings. Experiments performed on MTurk workers 
demonstrate higher efficacy (with a p-value of 0.02) of these mechanisms in inducing truthful 
behavior compared to the state of the art. 


1 Introduction 


Systems that leverage the wisdom of the crowd are ubiquitous today. Recommendation systems 
such as Yelp and others, where people provide ratings and reviews for various entities, are used by 
millions of oeonle across the fflobe (Luc m- Commercial crowd-sourcing platforms such as Amazon 
Mechanical Turk, where workers perform microtasks in exchange for payments over the Internet, 
are employed for a variety of purposes such as collecting labelled data to train machine learning 
algorithms |RYZ'*~10) . In massive open online courses (MOOCs), students’ exams or assignments 


are often evaluated by means of “peer-grading”, where students grade each others’ work PHC~*~13 


A common feature in many of these applications is that they involve a large number of similar 
evaluation tasks and every agent performs a subset of these tasks. For instance, a typical collection 
of tasks on Amazon Mechanical Turk comprises of labeling a large set of images for some machine 
learning application. A standard Peer-grading task in massive open online courses (MOOCs) in¬ 
volves grading of a large number of submissions for each assignment. We call these tasks massively 
crowdsourced evaluation tasks or MCETs. 


A major challenge in MCETs is incentivizing the agents to report their evaluations truthfully - 
to not try to game the system for monetary gain. This is achieved by designing appropriate reward 
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mechanisms, and there has been a considerable amount of prior work on designing such mecha¬ 
nisms for different settings. Unfortunately, most of these mechanisms have found limited success 
in practice. One critical drawback that we believe seems to impede their widespread use is the 
fact that these mechanisms have a complex structure and description, which makes it difficult for 
a typical agent to understand the mechanism and account for it while choosing their behavior. Re¬ 
cently, in an effort to promote practical deployments of market designs, there has been a significant 
push towards designing simpler economic mechanisms in the mechanism design research community 
[RublSl lLil5] . Indeed, simplicity in mechanism design has been a theme in many recent workshops 
in the research community; for instance, the abstract of one such workshop |Siml5) on “Complexity 
and Simplicity in Economics” quotes “Ideal economic systems must still remain simple enough for 
human participants to understand...” Due to the significant presence of the human element, these 
considerations all the more important in the case of crowd-sourcing. In this paper, we attempt to 
address these issues in the context of MCETs by designing a class of simple reward mechanisms 
that incentivize truthful reporting. 

The research on designing reward mechanisms for crowdsourced evaluation tasks falls largely 
into two categories depending on whether or not one assumes the existence of so-called gold stan¬ 
dard objects |LEHBinl ICMBNlT] . These are a small subset of objects, for which the principal 
either knows the correct evaluations apriori or can verify them accurately. Incentive design is then 
facilitated by scoring the agents on their performance on these objects by using proper scoring 
rules |IM ISZP151 ISZTK] . 

The present paper contributes to a second line of research, that makes no assumption about 
the existence of such gold standard objects, and is more realistic in many applications of interest 
where obtaining correct evaluations for a fraction of objects is either impossible or too costly. There 
have been several works that operate specifically in this domain and have designed clever incentive 
mechanisms while making different assumptions on the behavior of the agents, and on the knowledge 
of the mechanism designer about this behavior |MR.Zn51 IPreOdj IWP12j IRE 13] . 

Our mechanisms build upon the structure of output agreement mechanisms |VADn8| IVADOd] 
that are simple, intuitive, and have been quite popular in practice, except they suffer from a critical 
drawback of not incentivizing truthful responses in general. In an output agreement mechanism, 
two agents answer the same question, and they are both rewarded if their answers match. From 
the perspective of an agent, in the absence of any extraneous information, this almost incentivizes 
truthful reporting, since in many cases it is more likely that the other agent also has the same 
answer. But this is not the case when the agent believes that her answer is relatively unpopular 
and that a typical agent will have a different opinion. It is then tempting to report the answer that 
is more likely to be popular rather than correct. Moreover, there is an undesirable equilibrium in 
this game where every person reports the same answer irrespective of their true evaluation, which 
guarantees each person the highest possible payoff rewarded by the mechanism. Our mechanisms 
overcome these drawbacks by giving proportionately higher rewards for answers that turn out to 
be relatively less popular and lower rewards for answers that turn out to be more popular on an 
average. These rewards are designed in such a way that as soon as the agent sees the object and 
forms an evaluation, the conditional probabilities of the evaluations of another agent evaluating the 
same object change relative to the overall popularity of the different evaluations in such a way, that 
it becomes more profitable to report his/her opinion truthfully. This is achieved by leveraging some 
fundamental properties of the generating model. 

We consider a standard setting that assumes the existence of an underlying generating model 
that captures the inherent characteristics of each of these evaluation tasks, and the abilities/biases 
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of the agents. The mechanisms we propose are ^minimaV in the sense that they do not solicit any 
extraneous information from the agents apart from their own individual evaluations. Further, they 
make minimal structural assumptions on the generating model and do not require the knowledge of 
its details. Finally, motivated by practical concerns, truthfulness is incentivized in quite a strong 
sense, in that, the agents are allowed to have (almost) arbitrary opinions or beliefs about the details 
of the generating model, e.g., one agent may grossly underestimate the abilities of the other agents 
and overestimate her own ability, while another agent may have no such opinions. In order to 
achieve these objectives, our mechanism assumes the existence of a large number of similar tasks, 
which is typical of MCETs. 

An important distinction that naturally arises in the MCET setting is that between a homo¬ 
geneous and a heterogeneous population of agents |RF15) . Homogeneity of the agents intuitively 
means that all agents are statistically similar in the way they answer any question; it implies, for 
instance, that the agents do not have any relative biases or difference in abilities. As we argue later, 
such an assumption is reasonable in the case of surveys, where an agent’s answer to a question 
can be seen as an independent sample of the distribution of the answers in the population. But 
it is inappropriate in subjective evaluation tasks like rating movies or grading answers, in which 
systematic biases may exist because of differences in preferences, effort or abilities. In our design, 
propose mechanisms specifically tailored to both of these settings. In the heterogeneous case it is 
known |RF15) that it is necessary to impose certain structural restrictions on the generating model 
in order to be able to design truthful mechanisms. With this in mind, we restrict ourselves to 
the setting of binary-choice evaluation tasks, and then propose a mechanism that is truthful un¬ 
der a mild regularity assumption that is naturally justifiable in several MCETs of interest. This 
assumption is substantially weaker than other assumptions that have appeared before in literature 
for similar settings (e.g., |DG13) 1. 

Finally, we conduct experimental evaluation on Amazon Mechanical Turk to test how under¬ 
standable or “simple” these mechanisms with an output agreement structure and popularity-scaled 
rewards are, and how successful they are at inducing optimal behavior. We compare one of our 
mechanisms with a mechanism proposed in |RF15) . which is the current state of the art for the ho¬ 
mogeneous population MCET setting, as a benchmark. The experiments reveal that our mechanism 
is more successful in inducing truthful behavior (with a p-value equal to 0.02). 

The remainder of the paper is organized as follows. Section presents a formal description of 
the model considered in the paper. Given the model. Section puts our work in perspective of the 
existing literature. Sectionj^and Section [^contain the main results of the paper. Section [^presents 
a mechanism to incentivize truthful reports without asking for additional information, assuming 
that the population is homogeneous. Section then extends the results to a setting that does not 
make the homogeneity assumption. Our experimental results are presented in SectionThe paper 
concludes with a discussion in Section [3 


2 Model 

Consider a population denoted by the set M, with M agents labelled j = 1, • • • , M. Consider an 
evaluation task in which an agent j in M interacts with an object and forms an evaluation taking val¬ 
ues in a finite set § = (si, • • • , sk)- Examples of object and evaluation pairs are: Movies/businesses 
—7> ratings (e.g.. Yelp, IMDB), images —)> labels (in crowdsourced labeling tasks), assignments —)> 
grades (in peer-grading). An agent’s evaluation for an object is influenced by the unknown attributes 
of the object and the manner in which these attributes affect her evaluations, or in abstract terms. 
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her tastes, abilities etc. Note that the attributes of an object capture everything about the object 
that could affect its evaluation and as such these attributes may or may not be measurable. For 
example, in the case where a mathematical solution is being evaluated in a peer-grading platform, 
its attributes could be elegance, handwriting, clarity of presentation etc., taking values: “elegance: 
high, handwriting: poor, clarity of presentation: poor.” Denote the hidden attribute values of an 
object by the quantity X, which we will simply call the type of the object and assume that this 
type takes values in a hnite universe IK = {hi, ■ ■ ■ , h^,}. 

Denote agent j’s evaluation for the object by Yj G §. The manner in which an object’s attributes 
influence her evaluation is modeled by a conditional probability distribution over Yj given different 
values of X, i.e., P{Yj = s|X = h) for each s G S and h ^ "K. For notational convenience, we will 
denote this distribution by {pj{s\h)} and we will refer to it as the “hlter" of person j. Note that for 
each j, the hlter {pj{s\h)} can be represented as a stochastic matrix of size |!K| x |§| (recall that a 
stochastic matrix is one in which all the entries are non-negative and all the rows sum to 1). We 
assume that the hlters {pj{s\h)} themselves are drawn independently for each agent j, but from an 
identical distribution Q dehned on a support for all j, where 'B is some subset of the set of all 
stochastic matrices of size |fK| x |S|. 

In our setting there are N similar objects, labeled i = 1, • • • ,N, that are being evaluated. The 
type of object i is denoted by X^ and each X^ is assumed to be drawn independently from a common 
probability distribution Px over hf. Let M* C M denote the set of persons that evaluate object i 
and let Wj be the set of objects that a person j evaluates. If an agent j evaluates object i, let YJ 
denote her evaluation for that object. We assume that since the objects are similar, the hlters of 
any individual agent for evaluating the different objects are the same, i.e., P{Yj = = h) = 

P{Yf = s\X^' = h) =pj{s\h). 

Conditional on realizations of the hlters {pj{s\h)} for all j, we make the following independence 
assumptions: 

1. The evaluations YJ by different j are conditionally independent given X*. 

2. The sets of random variables {X^,{YJ : j G M*}} for the different objects i are mutually 
independent across objects. 

In particular the second assumption implies that YJ and Yj' are independent for any person j that 
has evaluated objects i and Note that the random variables {YJ : j G M*} for a single object i 
need not be independent unless conditioned on XL 

The pair of probability distribution over types Px and the distribution Q over hlters is then 
said to comprise a generating model denoted as {Px^ Q)- In particular, given the conditional inde¬ 
pendence assumptions above, they fully specify a joint distribution on the underlying types of the 
different objects, the hlters of the different agents, and the evaluations of the different agents of 
these objects. 

Our goal is to design a payment mechanism that truthfully elicits evaluations from the population. 
With MCETs in mind, we are specihcally interested in the case where N is large. The mechanism 
designer is not assumed to have any knowledge of Px or the hlters of the different people in the 
population, or of Q. Further, we assume that every member of the population knows the structure 
of the underlying generating model, in particular the existence of a single Px that generates the 

^This assumption precludes the possibility of dependence induced by lack of knowledge of some hidden information 
about an agent, e.g., if a worker’s mood is bad on a particular day, there may be a bias in all her evaluations. 
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type for each object, of the existence of some Q that generates the filters of every member, the 
conditional independence assumptions on the evaluations given the type for every object, and the 
independence of the evaluations across different objects. But the agents may not know, or may have 
different subjective beliefs about the values of Px, about Q, and even their own filter. We present 
an example of this setting. 

Example 2.1. Peer-grading in MOOCs: Peer-grading, where students evaluate their peers and 
these evaluations are proeessed to assign grades to every student, has been proposed as a sealable 
solution to the problem of grading in MOOCs. An important eomponent of any sueh seheme is the 
design of ineentives so that students are truthful when they grade others. For example, say that the 
answer of any student to a fixed question has some true grade A, B or C, whieh ean be taken to be 
the type of the answer. Suppose that apriori there is a distribution over the grade of any answer that 
is eommon to all answers (to a fixed question). Eaeh answer is then graded by a few students (and 
in turn eaeh student grades a few answers), who, depending on some given rubrie and their abilities, 
form an opinion as to what grade should be assigned to the answer. Similarly there are thousands of 
sueh answers that are graded by other students. It is natural to assume that eonditional on the true 
grade of an answer, the evaluations of different students who grade that answer are independent. Also 
it is natural to assume that the grades given by the students to different answers are independent. 
One then wants to design a meehanism that ineentivizes the students to report their true opinions 
about the answers that they have graded. 

Let qt denote the person j’s reported evaluation for object i. Then we have the following 
definition of a payment mechanism. 

Definition 2.1. A payment (or seoring) meehanism is a set of funetions {tj : j £ M}, one for eaeh 
person in the population, that map the reports {q) : i = 1, - ■ ■ ,N,jG M*} to a real valued payment 
(or seore). 

We will work with the following notion of detail-free incentive compatibility. 

Definition 2.2. Consider a elass C of generating models. We say that a given payment meehanism 
{xj : j £ M} is strictly detail-free Bayes-Nash incentive compatible with respeet to the elass C if for 
eaeh j £ M, 


E 


Tff{y) : i £ W,}, Y.ff (Yf = y] : i £ Wff 


> E 


rff{q):z£W,},Y.ff\\Yi = y): i£Wff 


( 1 ) 


for eaeh {yt : i £ Wj} {gj : i £ Wj}, where the eonditional expeetation is with respeet to the joint 
distribution on the evaluations of the population resulting from any speeifieation of the generating 
model {Px, Q) ^ class S, and any {pj{s\h)} in the support ofQ. Here Y-j = {YJ, : i = 1, - ■ ■ ,N, j' £ 


This definition implies that that as long as an agent believes that the generating model is in C, 
irrespective of whether or not she knows the generating model and her own filter, if everyone else 
is truthful, she gets a strictly higher payoff by being truthful. Thus truthful reporting is a strict 
equilibrium in the game induced by the mechanism if every agent believes that the generating model 
is in C. 
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We will consider two classes of generating models inspired by two types of applications that are 
encountered in practice. This difference arises from the considerations for the differences in the 
manner in which different agents evaluate an object. 

• Homogeneous population: Consider a typical survey, e.g., suppose the government wishes 
to find out the chance that a visit to the DMV office in a particular location at a particular 
time of the day faces a waiting time of more than 1 hour. This is a number X that can be 
thought of as an attribute of the DMV and for simplicity, assume that it takes values in a 
finite set, say [0, 0.1, 0.2, • • • ,1]. The evaluation Yj of any agent j is just a value {0,1}, with 1 
denoting that she faced a wait time of greater than 2 hours. In this case it is natural to assume 
that P{Yj = 1\X = h) = h, i.e., each person’s evaluation is an independent sample of the 
hidden value X. This means that pj{s\h) does not depend on j, and is the same value p{s\h) 
for everyone. In such a case, we say that the population is homogeneous, i.e., conditioned 
on the type of the object, different agents form their evaluations in a statistically identical 
fashion. In this case, Q has its support on a single filter: the population filter {p(s\h}. We 
will consider this case in Section H] 

• Heterogeneous population: In most subjective evaluations, the manner in which agents 
form evaluations differ considerably due to differences in preferences, abilities etc. So it is 
natural to assume that the filters vary across the population, i.e. Q has a support of size larger 
than 1. We will consider this case in Section]^ In this case, in general it is impossible to design 
detail-free truthful mechanisms for this case unless some additional structural assumptions are 
made on the support of Q. We will propose a natural structural assumption for the case |§| = 2, 
i.e., in the case where the evaluations are binary, and design a truthful mechanism under this 
assumption. 


3 Related work 

The theory of elicitation of private evaluations or predictions of events has a rich history. In 
the standard setting, an agent possesses some private information in the form of an evaluation of 
some object or some informed prediction about an event, and one would like to elicit this private 
information. There are two categories of these problems. In the first category, the ground truth, e.g., 
true quality or nature of the object or the knowledge of the realization of the event that one wants 
to predict, is available or will be available at a later stage. In this case, the standard technique is 
to score an agent’s reports against the ground truth, and proper scoring rules |GR07l ISav71[ ILS09] 
provide an elegant framework to do so. In the second category of problems, the ground truth is 
not known. In this case there is little to be done except to score these reports against the reports 
of other agents who have provided similar predictions about the same event. The situation is 
then inherently strategic, in which one hopes to sustain truthful reporting as an equilibrium of a 
game: assuming all the other agents provide their predictions truthfully, these predictions form an 
informative ensemble, and with a carefully designed rule that scores reports against this ensemble, 
one incentivizes any agent to also be truthful. The present work falls in this category. 

In this category, majority of early literature has focused on the case where a single object is 
being evaluated. In a pioneering work, the peer-prediction method by |MRZ05j assumed that 
the population is homogeneous and the mechanism designer knows the agents’ beliefs about the 
underlying generating model of evaluations. In this case they demonstrated the use of proper¬ 
scoring rules to design a truthful mechanism that utilizes the knowledge of these subjective beliefs. 
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These mechanisms are minimal in the sense that they only require agents to report their evaluations. 
In another influential work, |Pre04) considered a homogeneous population and designed an oblivious 
mechanism, famously termed Bayesian truth serum (BTS), that does not require the knowledge of 
the underlying generating model, but requires that the number of agents is large and that they have 
a common prior, i.e., they have the same beliefs about the underlying generating model and this 
fact is common knowledge. This mechanism is not minimal: apart from reporting their evaluations, 
agents are also required to report their beliefs about the reports of others. |WP12j and |RF13) later 
used proper-scoring rules to design similar mechanisms for the case where the population size is 
hnite. These mechanisms are again not minimal, and in fact it is known (see |,TFllj . |RF13) ) that 
no minimal mechanism that does not use the knowledge of the prior beliefs can incentivize truthful 
reporting of evaluations. 

It is the case in many applications in crowd-sourcing, that one is interested in acquiring evalu¬ 
ations from a population for several similar objects. It is thus natural to explore the possibility of 
exploiting this statistical similarity to design better (e.g. minimal) mechanisms for jointly scoring 
these evaluation tasks. This is the context of the present work. Three major works in this area that 
have considered this case are |WP13j . |DG13| and more recently, |RF15j . Both |WP13j and |DG13) 
only considered the case where the evaluations are binary. The former considered a homogeneous 
population while the latter considered a heterogeneous population, while both making specihc as¬ 
sumptions on the generating model. |RF15) on the other hand have considered both homogeneous 
and heterogeneous populations. 

For a homogeneous population with multiple objects, |WP13| try to utilize the statistical inde¬ 
pendence of the objects to estimate the prior distribution of evaluations and use that to compute 
payments using a proper scoring rule. In spirit, we are similar to this approach (and also |Pre M) in 
the sense that we use the law of large numbers to estimate some prior statistics and we get incentive 
compatibility for a large population, but we do not restrict ourselves to the binary setting. |RF15) 
recently have also designed a mechanism that is truthful in the general non-binary setting while 
requiring only a hnite number of objects, again using proper scoring rules. In their mechanism, for 
computing the reward to an agent for evaluating a given object, a sample of evaluations of other 
agents for other objects of a hxed size needs to be collected, and an agent’s reward can be non-zero 
only if this sample is sufficiently rich, i.e., it has an adequate representation of all the possible eval¬ 
uations. Although our mechanism needs the number of objects to be large, it has a much simpler 
structure. 

For the case of heterogeneous population, |RF15) show that typically one cannot guarantee truth¬ 
fulness with minimal elicitation. Nevertheless, |DG13| have designed truthful minimal mechanism 
for the case of binary evaluations for a specihc generating model: it is assumed that IK = S and for 
each agent, the probability of correctly guessing the true type of the object is at least 0.5 and it 
does not depend on the type. Although we also consider the binary evaluations, we allow TC to be 
arbitrary and our regularity condition is considerably weaker. 

In a parallel development, |SAFP1^ elegantly extended the mechanism in |DG13) for the hetero¬ 
geneous population setting to handle more than two evaluations. But their mechanism is truthful 
only if the joint distribution of the evaluations seen by two agents evaluating the same object satisfy 
a property that they refer to as being "categorical". It is a somewhat restrictive condition which 
says that if an agent makes an evaluation si, then the conditional probability that the other agent 
makes a certain other evaluation S 2 reduces relative to the prior probability of making that evalu¬ 
ation, for every other evaluation S 2 - If this condition is satisfied in our setting, then an "additive" 
mechanism that we suggest for the case of a heterogeneous population is trivially truthful for non- 
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binary settings, almost by assumption. They also design a mechanism that is truthful in general, 
in particular without this restriction, but they require that the mechanism designer has access to 
certain information about the joint distribution of the evaluations for an object by two agents. In 
another parallel development |RFJ16j . the authors consider the heterogeneous population setting 
and propose almost exactly the same mechanism as ours: it has an output agreement structure 
where the rewards for matching on an evaluation are inversely proportional to an estimate of the 
prior probability of seeing that evaluation. They show that the mechanism is truthful for the gen¬ 
eral non-binary setting, but the condition under which this holds is the same as the condition for 
truthfulness, and it is not clear if if would reasonably hold in practical settings. 

Contrary to the assumptions in these two works our regularity condition is a precise condition on 
the generating model that can be mapped to a condition on the "behavior" of the agents. It implies 
both the condition in |RF,I16) and the condition in |SAFP1^ in the binary setting, and further it 
can be argued to naturally hold in most MCETs of interest. 


4 Homogeneous population 

In this section, we will first consider the case where the population of agents is homogeneous. We 
will consider the following class of generating models (Px, Q), that we will call Chom- 

Definition 4.1. Qhom is the class of all generating models {Px, 2) tho-t satisfy the following set of 
assumptions. 

1. The population is homogeneous, i.e., {pj{s\h)'\ = {pji{s\h)'\ for any j, f G M. This means 
that Q has support of size 1. {p{s\h)} denotes the common population filter. 

2. Define 

S{Px,Q)= min Px{h)p{sk\h)‘^){'^ Px{h)p{si\h)‘^)-'^ Px{h)p{sk\h)p{si\h). 

y 

By the Cauchy-Schwarz inequality 5{Px, Q) > 0. Then there is some Jq > 0 such that 

inf 6{Px,Q)>6o. (2) 

{Px,Q)(^ehom 


Let us take a closer look at the second assumption. The Cauchy-Schwarz inequality has the 
following geometric interpretation. For any evaluation s G S, define the vector 


vis) = [y/Px{hi)p{s\hi), • • • , y/Px{hL)p{s\hL )], 


(3) 


in the Euclidean space M^. Then the Cauchy-Schwarz inequality says that for any two evaluations 
Sk and si, the magnitude of the projection of the vector v{sk) on the unit vector in the direction 
v{si) is less than the magnitude of the vector v{sk) itself (one can reverse the roles of Sk and si), 
i.e.. 


ll^'(sz)ll 


< lk('Sfc)||, 


or 
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If we let 9{v{sk),v{si)) denote the angle in radians between two non-zero vectors v{sk) and v{si), 
defined as 

Of f ^ ^ v{sk).v{si) 

e{v{sk), v{si)) = arccos ,, . .mi . (4 

then the inequality is strict if and only if the angle between the vectors v(sk) and v{si) is positive and 
their magnitude is non-zero. In fact, under the condition that ||w(s)|| < 1 for all s G S, which holds 
in our case, we can show that the the second assumption is equivalent to the following assumption. 

Assumption A: There is a tq > 0 and kq > 0 such that for any (Px,Q) G Qhom-, the following 
holds: 


> "^0 for each s G S, and 

2. 6{v{sk),v{si)) > kq (note that since these r;(s) are component-wise positive, we have kq < 
7r/2). 


The first condition says that the probability of an agent forming any evaluation s G S for an object 
is uniformly bounded away from zero for all generating models in the class. To get an intuition for 
the second condition, consider the case the angle between v{sk) and v{si) is zero. One can show 
that this happens only when there is a C G M such that p{sk\h) = Cp{si\h) for each h ^ “K such 
that Px{h) > 0. But this case the evaluations Sk and si need not be distinguished at all, since they 
contain the same information about A*. In particular, P(A* = h\Yj = s^) = P(A* = h\Yj = si) 
for each /i G IK. Hence one can equivalently consider a generating model with a fewer number of 
possible evaluations. 


Proposition 1. Assumption A and assumption 2 in definition 4-1 CLfe equivalent. 


Proof. To see that assumption A implies the second assumption, note that 9{v{sk),v{si)) > kq 
implies that: 


v{sk).v{si) 


< cos KQ) 


\\visk)\\\\visi)\\ 

Multiplying throughout by ||u(sfc)||||f(s;)||, we have: 

ll^(sfc)llll^('Sz)ll - v{sk)-v{si) > (1 - COSKo)|k(Sfc)||||u(Si)ll > (1 - cosko)tq > 0. 
Here in the last inequality, we use that fact that 


\v{s)\\ = 



Pxih)p{s\hy > ^ Pxih)p{s\h) > tq. 


which follows from the Jensen’s inequality. The reverse direction is less straightforward and this is 
where we need to use the fact that ||u(s)|| < 1 for all s. First of all 


|r;(sfc).u(sz)| < ||v(sfc)||||?;(sz)|| - Jq, 

implies that either ||u(sfc)|| or ||w(s;)|| is non-zero. Say ||w(si)|| > 0. Then dividing on both sides, 
we get: 








Mechanism 1: Hom-OA for a homogeneous population. Assumes that |M*| > 3. 

The observations of all the people for the different objects are are solicited. Let these be 
denoted by {(7^}, where every G S. A person j’s payment is computed as follows: 

• From each population M*, choose any two persons ji and j 2 different from j, and for each 
possible evaluation s G S, compute the quantity 

fj{s) — 

Then compute 

1 ^ 

/jW = vE/i(»)' 

i=l 

• For each evaluation s, fix a payment rj{s) defined as 


rj{s) 


VW) 

0 if fj{s) = 0, 


where K > 0 is any positive constant. 


• For computing person j’s payment for evaluating object o G Wj, choose another person j' 
who has evaluated the same object o. If their reports match, i.e., if g*- = g*, = s, then the 
person j gets a reward of rj{s). If the reports do not match, then j gets 0 payment for the 
evaluation of that object. 


where the last inequality holds since ||u(sfc)|| < 1. In other words: 

\\v{sk)\\cos9{v{sk),v{si)) < ||u(sfc)|| - do, 


or 

ll^^(sfc)ll(l - cose{v{sk),v{si))) > do- 

Since ||u(sfc)|| < 1 and (1 — cos6{v{sk),v{si))) G [0,1], this implies both ||u(sfc)|| > <^o and (1 — 
cose{v{sk),v{si))) > So, i.e., 0{v{sk),v{si))) > arccos(l-(5o). Finally we have Px{h)p{s\h) > 

||^^(■SA:)|P > <^ 0 - Note that (5o < 1 so that arccos(l — do) < tt/2.. 

□ 


We present our proposed mechanism for this case in Mechanism 1, denoted as Hom-OA, where OA 
stands for output agreement. The mechanism has the structure of an output agreement mechanism, 
where a person is rewarded for evaluated an object only if his evaluation matches that of a chosen 
peer who has evaluated the same object. In our mechanism this reward itself depends on how 
"popular" the matched evaluation is overall across all the objects, where the notion of popularity is 
defined in a particular manner. In the following theorem, we show that the mechanism is Bayes-Nash 
incentive compatible for a large enough N. 


Theorem 2. There exists a Nq depending only on Sq and K sueh that if the number of objeets is 
N > No, Meehanism 1 is strietly detail-free Bayes-Nash incentive compatible with respect to the 
class Qhorri' 
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Proof. First, note that the computation of the payments rj{sk) for the different Sk is unaffected 
by the reports of person j. Next, suppose that everyone but a person j is truthful. Recalling the 
definition of v{s), denote 


N 


E{fj{s)) = Px{h)p{s\hf = ||u(s)f = g{s). 


i=l 




In the proof of Proposition we have seen that the second assumption in the definition of class 
Qhom implies that 11^(5)11 > (5o) and thus we have g{s) > (5q > 0 for all s € S. Next, recall that 


We will show first that 


^j(s) = 1/, 


E(r,(s)) = 


K 


is)^0 


K 




fjis) 


+ m{N), 


where there exists a function of N, cr(N) > 0, that depends only on K and 6q, i.e., it is independent 
of the generating model, such that |m(iV)| < cr{N) and cr{N) = o(l). i.e., limAr^oo= 0. To 
show this, we first have for any e > 0: 


E{rj{s)) > P(/fc(s) G [g(s)(l - e), 5 r(s)(l + e)]) 


K 


\/ fl'(s)(l +' 


> (1 - 2exp(-e^5(s)^iV)) 


> (1 — 2 exp(— 


K 


y/g{s){l + e) 
K 


V9{s){l + e) 


> 


K 




— 2exp(—e^<5oA^) 


K 


<Jo-\/(l + e) 

2r4, 


> ^^(1 - F + - 2exp(-e2(5^iV)- ^ 

V9[s) 2 ro^{l + €) 

Here the second inequality follows from Hoeffding’s inequality, and the fifth is the Taylor series 
approximation of the function l/\/l + e, where /i(e) = o(e). The other inequalities result from the 
fact that g{s) > 8^. Thus we have 


E{r,{s)) > 


Taking e = N we have: 


K eK K\h{€)\ 


2(5o (5o 


— 2exp(— 


K 


8o\/ (1 + 


E{rj{s)) > 


K 


V9ii 


-m 


(5) 


where 1{N) > 0 and as a function of N, it depends only on Sq and K, and further lim^v-^-oo ^(-^) = 0- 
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Next we also have. 


E{rj{s)) <P{fk{s) G [g{s){l-e),g{s){l + e)]) 


K 


V9{s)il - 


+ E \ If. 


K 


< 


/j(s)^{0}U[c/(s)(l-e),g(s)(l+e)] 

K 


< 


< 


y/g{s){l - e] 
K 

V9is){l - e] 
K 

y^g{s){l - e] 


+ ^ If- 


/j(s)^{0}U[5(s)(l-e),g(s)(l+e)] 
, 2 „/„n 2 


KN 


+ 2KN exp{—€"^g{sYN) 


+ 2KNex.p{-e^S^N) 


< -^={1 + ^ + vj{e)) + 2KNexp{-e‘^S^N) 

Vais) ^ 

<-^ + 9i + + 2KNeM-^H‘N) 


VaiV ^'^0 


-5o 


Here the second inequality results from the fact that on the event {fj{s) Y 0}, fjis) > 1/N. 
This is because fj{s) only takes values in the set [0, • ,1]. The third inequality follows from 

Hoeffding’s inequality, and the fifth follows from the Taylor approximation of the function 1 /\/l — e, 
where w{e) = o(e). Now choosing e = we get: 


E{rj{s)) < 


K 

Vais) 


+ uiN) 


( 6 ) 


where u{N) > 0 and as a function of N, it depends only on Jq and K. Further limAr_^oo VN) = 0. 
Hence defining cr(N) = max(if(A^), Z(A^)) we have |m(A^)| < cr{N) where cr(N) = o(l) and it 
depends on 5o and K. 

The expected reward of person j for evaluating object i, if she reports g*- = si when her true 
evaluation is is 


Risi, Sk) = P(T/ = si\Yj = Sk)Eirj{si)) 


J2h(^:KExih)pisk\h)pisi\h) 

J2heKExih)pisk\h) 


Eirjisi)) 


Similarly, we have R{sk,Sk) = ^^ ^Pxih)p(sl |fe) Eirjisk))- Next, lying is strictly worse for person 

j if Ri-suSk) < Risk,Sk), that is if 


i.e., if 


E 


< 


h^^Pxih)pisk\h)p{si\h) 
J2heKExih)pisk\h) 
Ehe3<Exih)pisk\h)^ / 
EhGJiPxih)pisk\h) V 


V V^fVVExihMVW ^ 

K \ 

, = + m(N) , 

VEVVPVUMVIW ^ ’J 


Pxih)pisk\h)pisi\h) < i'^ Pxih)pisk\h)Vi'^ PxiVpisilh)"^) 

heJ< V he5< h&'K 


miN)\ 

K 
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Further, for every generating model in class Qhom-, we have 


^ Px{h)p{sk\h)p{si\h) + 5o < Px{h)p{sk\hY)C^ Px{h)p{si\hY). 

h^-K Y fteM heJC 

where 6q > 0. Thus truthtelling gives a strictly better payoff if < Sq. Since \m{N)\ < (t{N), 

which depends only on Jo and K and (t{N) = o(l), there is an A^o depending only on Jq and K such 


that for all N > Nq, 


K 


< Jo irrespective of the generating model in Qhom- 


□ 


Note how detail-freeness follows from the fact that the proof does not depend on the specific filter 
{p(s|/i)}, but depends on a universal property shared by any such filter, i.e, the Cauchy-Schwarz 
inequality. 


4.1 Remarks: 

An alternative to the peer prediction method: In the case where the mechanism designer 
knows the underlying generating model Px and {p(s\h)}, the mechanism can compute the rewards 
for each evaluation directly, without having to estimate statistics from evaluations for multiple 
objects. In order to do so, for each evaluation Sk, one defines 

g{sk) = Px{h)p{sk\hy, 

Y heJi 

and defines payments r{sk) for the different evaluations as r(sfc) = (hi case g{sk) = 0 for some 
Sk, then one can simply not allow that signal to be reported). In this case, our mechanism provides 
an alternative to the peer prediction method of |MRZ05j , while using the simple structure of output 
agreement mechanisms and without using proper scoring rules. 


Relaxing the requirement that |Mi| > 3 : The assumption |Mj| > 3 is needed to ensure that 
for any object, there are at least two persons other than any given person j who have evaluated the 
object, which in turn ensures that the computation of the fj{s) values are unaffected by the reports 
of j. In practice, even if one computes the fj{s) values by randomly selecting any two persons for 
each object, as long as |Wj| is small and N is large, these values will not be affected much by the 
reports of j. In this case, one can drop the subscript j, and use the same values f{s) to compute 
everyone’s payment. 


Truthfulness gives higher payoff than random sampling: Truthful reporting is not the only 
equilibrium of this mechanism. One class of equilibria is where each person in the population reports 
evaluations sampled independently from the same distribution, say {^(s) : s G S}, independent 
of their observations. But one can easily show that the truthful equilibrium gives more reward 
in expectation to each person than in any of the equilibria in this class. In this case, we have 
E{rj{s)) = + m{N) for all s such that z{s) > 0, where m{N) = o(l). Thus the expected 

payment of each agent for evaluation of one object is Ylses m{N)) = K + r{N), where 

r{N) = o(l) whereas the expected payment for evaluation of one object in the truthful equilibrium 


is 


Y.P{Yj = s)P{Y,, = s\Y, = s)E{rj{s)) = J] 
ses ses 


KEhe^Pxih)p{s\h)^ 

V^heJiPx{h)pis\hy^ 


+ q{N) 
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Px{h)pis\h)^ + q{N)>KY,Piyj = s) + q{N)=K + q{N), 
seS y /isM seS 

where q{N) = o(l). The last inequality follows from the Jensen’s inequality. In fact, this inequality 
is strict for the class Qhom with a gap that is universally bounded away from zero. To see this, note 
that the inequality is not strict only when Yj and X are independent, i.e., the population hlter is 
such that the evaluations are independent of the type of the object. But if that is the case, one 
can verify that 6{{Px,Q)) = 0 thus violating our assumption. Thus for a large enough N, the 
truthful equilibrium gives a strictly higher expected payoff. 


Conjecture - Truthfulness gives higher payoff than any symmetric equilibrium: We 

conjecture that for a large enough N, truthful reporting gives at least as much payoff to each 
individual as in any symmetric equilibrium, where each person in the population maps the observed 
evaluation s for any object to a reported evaluation s' with some probability < 7 ( 5 ^|s) for each s, s' G §. 
In fact, we conjecture that under the conditions satished by Qhom^ the payoff is strictly higher 
compared to all symmetric equilibria except the ones that result from relabelling the signals. Any 
symmetric equilibrium is equivalent to the truthful equilibrium in which the population hlter is 
given by: 


P'is\h) = ^p{s'\h)q{s\s'). 
s'e$ 

The expected payment of each agent for evaluation of one object in the truthful equilibrium is 
^ P{Yj = s)P{Y^, = s\Y, = s)E{rj{s)) = ^ ^ + q{N) 


ses 


ses 




^Px(h)p(slhy + q(N) 


ses y heJC 

where q(N) = o(l). Similarly, the expected payoff in any symmetric equilibrium is: 


JE Px(h)p'(slhy + r(N) =kY,JY 1 Px{h){Y,P{s'\h)q{s\s')y + r(N) 


ses V heJ< 


ses v hey< 


s'e§ 


We conjecture that the following inequality holds in general, and strictly under the assumptions 
of the class C/iom,: 

E JE Px(h}p(slh)^ > ^ Px(h)(^p(s'lh)q(sls'))^. (7) 

seS y h€J< seS y h€J< s'eS 

We have not been able to prove or disprove it thus far. The inequality has the following interpre¬ 
tation. Consider the following dehnition. 

Definition 4.2. Consider two random variables Yi and Y 2 , taking values in a finite set §>, sueh that 
they are conditionally independent and identically distributed given some random variable X taking 
values in a finite set Jf. Then the agreement measure between Yi and Y 2 is defined as 

r{Y,,Y2) = VPiy = Y2 = s) 

s£S 

^To see this, observe that for any Sk and si, the angle between the two vectors v{sk) = PiXj ~ 
Sk)yPx{hfi, ■■■ , v/iMM] and v{si) = PiY = Sl)[^/Px{hl), ■ ■ ■ , y/PxiJiL)] is 0, and hence the Cauchy-Schwarz 
inequality is not strict. 
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Now if X has distribution Px, and the conditional distributions of Yf and I 2 given X are denoted 
as {p{s\h}, then the agreement measure is PxWp{s\hy, and this is the expected 

payoff to an individual under the truthful equilibrium in our mechanism. The agreement measure 
has the following properties: 

1. r(li,y 2 ) > 1. To see this, note that Jensen’s inequality implies that 


E, E Pxih)p{s\hy > EE Px{h)p{s\h) = 1. 

seS y he3< ses heJC 

In fact r(Y’i,y 2 ) = 1 only when Yi and 12 are independent. 

2. r(li,l 2 ) < -\/|^- To see this, note that Jensen’s inequality implies that 

Xpx{h)p{s\hp < isi./Ej^ Px{h)p{s\h)‘^ 
ses V he^K V hew 

S |si,/ 4 EE Px{h)p{s\h) = v 1 ^. 

V ' ' seS h&^K 

In fact r(li, I 2 ) = only when Yf and I 2 are identical and they are distributed uniformly, 
i.e., Yi = Y 2 and P{Yi = s) = 1/|S|. 

Now our conjecture is true if the agreement measure has the following property. Suppose Zi 
and Z2 are two random variables such that 1) Z\ and X are conditionally independent given Yj, 
and Z2 and X are conditionally independent given Y 2 and 2) Z\ and Z2 have the same conditional 
distributions given Y\ and Y 2 respectively. Then clearly Z\ and Z2 are conditionally independent 
and identically distributed given X. Then we would like to show that 


r(yi,y2) >r(Zi,Z2). 

That is, intuitively, the agreement measure decreases if some information in Ti and y 2 is unilaterally 
lost. 


5 Heterogeneous population 

We now consider the case of a heterogeneous population. In this case it is known (see |RF15| 1 
that unless additional structural assumptions are made on the class of generating models, it can 
be impossible to design a strictly detail-free Bayes-Nash incentive compatible mechanism. To get 
an intuition for the key issue, consider the following example. Assume a movie is being evaluated. 
Suppose that its type is in the set IK = {Action, Drama} and that the set of evaluations is S = 
{Good, Bad}. Consider two filters: an ‘action-lover’ filter and a ‘drama-lover’ as shown in the 
Figure Now consider a generating model such that both these filters are in the support of Q 
and consider an agent, say Bob, that has an action-lover filter and another agent Alice who has a 
drama-lover filter. Now observe that the filters are such that the conditional distribution over the 
evaluations of all the other agents in the population from the point of view of Bob, given that he 
evaluates the movie to be good, is the same as the conditional distribution over the evaluations of 
all the other agents in the population from the point of view of Alice, given that she evaluates the 
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Action 


Drama 


Action 


Drama 




Action-lover filter Drama-lover filter 

Figure 1: Two filters. 


movie to be bad. So any strictly detail-free Bayes-Nash incentive compatible mechanism that strictly 
incentivizes Bob to report ‘Good’ when his evaluation is ‘Good’ also simultaneously incentivizes Alice 
to report ‘Good’ when her evaluation for the movie is ‘Bad’. Hence there can be no mechanism that 
is strictly Bayes-Nash incentive compatible with respect to this generating model. 

5.1 Regular filters 

The previous result intuitively implies that if the preferences of the agents in the population varies 
too much, then designing a knowledge independent strictly truthful mechanism is impossible. One 
thus needs to impose some uniformity or regularity on the population hlters for there to be any hope 
of designing such mechanisms. With this in mind, we look at the specihc case of binary evaluations, 
i.e., |S| = 2, and impose a notion a regularity on the hlters. 

Definition 5.1. We say that a generating model {Px,Q) with |S| = 2 is (5-regular for some (5 > 0 
if there is a fixed ordering of types in the set “K, say, 

hi >- h2 >-■■■>- hi, 

sueh that every filter {pj{s\h)} in the support of Q satisfies the following regularity eondition with 
respeet to this ordering: 

Pj{si\h) > pj{si\h') + 6 if h >-h' ( 8 ) 

for eaeh h ^ h' £ IH. Note that this implies that Pj{s 2 \h) < Pj{s 2 \h') — 5 if h >- h'. 

It follows from the dehnition that if a generating model is (5-regular, then it is also (5'-regular 
for any 5' £ [0,(5]. Intuitively, the regularity condition implies that the types that are higher in 
the ordering are in some sense "closer" to si and those lower in the ordering are "farther away" 
from si, and vice versa for S 2 ) so that an agent making a particular evaluation is more likely if 
the object type is "closer" to that evaluation. Note that for any particular hlter pj{.\.), as long 
as p{si\h) 7 ^ p{si\h') for any h,h', one can always dehne an ordering of the types such that this 
condition will be satished for that hlter (if this is not satished for some h and h', then h and h' can 
be clustered to form a single type). But the regularity condition says that there is one such hxed 
ordering of types for all hlters in the support of Q. 

Definition 5.2. For a generating model {Px,Q), the ensemble hlter is defined as 

p{s\h) = EQ{pj{s\h)), 
where the expeetation is with respeet to Q. 


16 



Clearly, if a generating model is J-regular, then the ensemble filter satisfies the regularity condi¬ 
tion as well. 

Let us take a closer look at the (^-regularity assumption with the example of peer grading. 
Consider the case where |iK| = |S| = 2. In this case, w.l.o.g., we can assume that !K = S = {hi, h 2 }. 
Then the (5-regularity condition will be satisfied if either: 

• pj(hi|/ii) > Pj(hilh 2 ) + <5 (and thus Pj(h 2 |/i 2 ) > Pj(/i 2 |hi) -|- (5) for all filters in support of Q. 

• Pj(^il^ 2 ) > Pj(hilhi) + 6 (and thus Pj{h 2 \hi) > Pj{h 2 \h 2 ) + <5) for all filters in support of Q. 

In the context of peer grading, the first condition basically says that a person judging that a 
homework submission deserves a grade A is more likely if the true grade is A (in fact, more than 5 
so), than if the true grade is something else, which is quite intuitive and can be safely assumed to 
hold. Note that this is different from saying that 

Pj{hi\hi) > Pj{h2\hi) + (5, 

which says, for example, that if the true grade is A, then it is more likely that a person thinks it is 
A than it is something else. In fact one can show that, this condition implies regularity (and hence 
is stronger than regularity) if |iK| = |§| = 2|^ 

Also, in the case of peer grading, the true type of the answer may be drawn from a rich space 
of features like we have discussed earlier, e.g., handwriting, presentation etc. In such settings, the 
regularity assumption may be violated with respect to this entire space of types, since different 
students may be biased towards considering different features of an answer more important than 
others. For example one student may value presentation of the answer more than elegance of the 
solution or handwriting, while other may have different biases. However, in practice, each submission 
is typically evaluated separately with respect to each of these features. It is reasonable to assume 
that the regularity assumption is well justified with respect to each of these individual features. 
Now we will consider the class of generating models Qreg- 

Definition 5.3. Suppose that |S| = 2. C^eg is defined to he the class of generating models {Px,Q) 
that satisfy the following set of assumptions. 

1. Each generating model {Px, Q) £ Greg is do-regular for some (5o > 0. 

2. There is an €o > 0 such that Px{h) > cq for each h ^ TC for all generating models in Greg- 

The proposed mechanism for this class is presented in Mechanism 2 (Het-OA). We then have the 
following result. 

Theorem 3. Consider a sequence of populations {M(A^)} indexed by the number of objects N that 
satisfies |M(A^)*| > 2 and |'W(A^)j| < C, where C is finite. Then there is an Nq depending only on 
do, eo and K such that for all N > No, Mechanism 2 is strictly detail-free Bayes-Nash incentive 
compatible with respect to the class Greg- 

^In fact, there are many situations where this stronger condition may not hold. A popular example is that many 
people do not know that the capital of the US state of Illinois is Springfield, and not Chicago, as is commonly assumed. 
That is, given that the capital is Springfield, it is not true that that a majority of the population thinks that it is 
Springfield. But it is reasonable to assume that more people think that the capital is Springfield if the capital was 
actually Springfield (which it is) than if the capital was something else, say Chicago. This is precisely the regularity 
assumption. 
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Mechanism 2: Het-OA: Heterogeneous population with binary evaluations and regular filters. 

Assumes that |M*| > 2. 

The observations of all the people for the different objects are are solicited. Let these be 
denoted by {q^j}, where every G S. A person j’s payment is computed as follows: 

• Form the largest set possible of distinct persons different from j such that each person has 
evaluated a different object from everyone else. Denote this set by Uj and the set of different 
objects evaluated by this set by Aj. For each i in Aj, let j{i) denote the person in ILj that 
has evaluated i. Next, for each of the two evaluations si and S 2 , and for i G Aj, define 


Then compute 






1 




• For the evaluations si and S 2 , fix payments rj{si) and rj{s 2 ) defined as 


rj{s) 


K 

fj(s) 

0 


if fj{s) / 0, 
if fj{s) = 0, 


where iL > 0 is any positive constant. 

• For computing person j’s payment for evaluating object o G Wj, choose another person j' 
who has evaluated the same object o. If their reports match, i.e., if g*- = g*, = s, then the 
person j gets a reward of rj{s). If the reports do not match, then j gets 0 payment for the 
evaluation of that object. 


The high-level idea of the proof is the following. First, | W(A^)jj < C, where C is finite, means that 
each person evaluates only a finite number of objects. This together with the fact that |M(A^)*| > 2, 
i.e., each object is evaluated by somebody, means that there are at least [N/C\ evaluations of 
distinct objects that are made by distinct persons, and thus \Aj\ > [A^/CJ — 1 = 0{N). Hence by 
the law of large numbers, for an agent j, fj{s)) will converge to its mean as the number of objects 
grows, which is the probability that an arbitrarily chosen person in the population will report a 
signal s for an arbitrary object. This probability is given by 

P{Y = s) = ^ Px{h)E{pj{s\h)) = ^ Px{h)p{s\h) > eo^o- 
he^K heJi 


By arguments similar to those used in Theorem 1, one can show that E(rj{s)) converges to K/P{Y = 
s). Now suppose that an agent makes the evaluation si. Then our assumptions on the generating 
model, in particular the do-regularity condition ensures that P{Yji = si\Yj = si) — P{Y = si) > 0, 
and hence P{Yp = S 2 \Yj = si) — P{Y = S 2 ) < 0. In words, conditional on making a particular 
evaluation, the probability that a randomly picked agent that has evaluated the same object makes 
the same evaluation increases relative to the unconditional probability of an arbitrary agent making 
that evaluation for an arbitrary object. Thus 


P{Yi, = si\Yi = Si) 


K - - — - > K ■’ 


PiYf, = S 2 \Yf = si) 


P{Y = si) 


P{Y = S2) 
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For a large N, the LHS is approximately the expected payment the agent receives if she reports si 
and RHS is approximately the expected payment if she reports S 2 (with the approximation error 
decaying in N. Further the regularity assumption implies that the gap in this inequality is actually 
positive (this gap depends on Sq and cq). Hence the mechanism is truthful for a large enough N. 


|RFJ16| propose a similar output agreement mechanism where reward for matching is scaled 
by the prior probability of the evaluation, that is shown to be truthful in the non-binary setting 
assuming the condition: 


arg max 
s'es 


P{Yp = s'\Yj = s) 
P{Y = s') 


= s. 


Clearly, under this condition, our mechanism is also truthful in the non-binary setting. But in 
the non-binary setting, it is unclear if this condition is true under reasonable assumptions, unlike 
regularity in the binary setting. In fact, it is not even clear if this condition is true under reasonable 
assumptions in the homogeneous population setting. 


Proof of Theorem^ By arguments similar to the ones presented in the proof of Theorem 1, we can 
show that 

E{rj{s)) = = p^Y = s) 

for fc = 1,2, where \m{N)\ < (t{N) = o(l), where cr(N) depends only on eo, K, and not on 

the particular choice of generating model in Qreg- Next we have 


P{X^ = hi\Yj = si) 


Px{hi)pj{si I hi) 
Px{hi)pjisi\hi) 


Px{hi)t{si,hi), 


where we have defined t{si,hi) ^ 
have that 


Because of strict regularity of the filters, we 


t{si, hi) > t{si, h 2 )> ■■■ > t{si,hL) 


and further t{si, hi) > 1. Now from the point of view of person j, the distribution of the report of 
a randomly chosen person j' who has evaluated the same object is 


PiXj, = si\Yj = Si) = ^P(X' = hi\Yl = si)p{si\hi) = Y,Pxihi)tisi,hi)p{si\hi), 

hi hi 

and P{Yj/ = S 2 \Yj = si) = 1 — P(Pj' ~ ~ "This holds because YJ for different persons j 

and j' are conditionally independent given the type X*. Next we have 

P{Y}, = si\Yj = si) - P{Y = si) = Y,Pi^i\hi)(Pxihi)t{si, hi) - Pxihi)\ 

hi ^ ^ 

We want to show that this quantity is positive and bounded away from 0, i.e., if a person has an 
evaluation si for an object, then the posterior probability that another person also has the same 
evaluation for that object strictly increases relative to the prior. Now since t{si,hi) > t{si,h 2 ) > 
■ ■ ■ > t{si, hi) and t{si, hi) > 1, since Ylhi Px{hi){t{si, hi) — 1) = 0, we cannot have t(si, hi) > 1 
for all I, and hence there must be some I* such that t(si, hi) > 1 for all I < I* while t{si, hi) < 1 for 
all I > I*. Then we have that 

Px{hi){tisi,hi)-1)>0 (9) 
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and 


( 10 ) 


Px{hi){t{si,hi)-'\-) = - Y. Px{hi){t{siM)-l)- 

In fact, we can prove the following lower bound: 


Y Pxihi){t{si, hi) - 1) > Pxihi){t{si, hi) - 1) 

Px{hi)pjisi\hi 


— Px{hi) 


> 


> 


T.hi Px{hi)pj{si\hi) 

Px{hi)pj{si\hi) 

Px{hi)pj{si\hi) + (1 - Px{hi))pj{si\h 2 ) 

__ Pxjhi) 

Px{hi)pj{si\hi) + (1 - Px{hi)){pj{si\hi) - (5o) 
_ Pxihi){l - Px{hi))5Q 
Pji^i\hi) - (1 - Px{hi))6o 

> Px{hi){'\- — Px{hi))6o 

> eo(l - eo)<5o 


— Px{hi) 

— Px{hi) 


( 11 ) 


Now we have: 


P{Y^, = si\Y; = Si) - PiY = Si) 

= Y P{si\hi)Px{hi){t{si,hi) - l) + Y^ p[si\hi)Px{hi){t{si,hi)-1) 
>p{si\hi*) Yj Px{hi){t{si,hi)-1)+p{si\hi*+i) Y, Px{hi){t{si,hi) - 1) 

hi;l<l* hi]l>l* 


= Pisi\hi*) -p{si\hi*+i) 


> <^o^o(l — eo) — wq. 


Y. Pxihi){t{si,hi) - 1) 

hi ;/</* 


The first inequality follows from the fact that Px{hi)t{si, hi) — Px{hi) > 0 (< 0) for I < I* {> I*) 
and that p(si|/ii) > p(si|/i 2 ) > • • • > p{si\hL) by the strict regularity of the ensemble filter. The 
second equality follows from (llll) and (10). It follows that 


P(y/ = S2\Yj = Si) - P{Y = S2) < -UJO. 
Hence the expected payoff if a person j reports si, if he observes si is 


= P(Yf = si\Yj = si)E{r,{si)) 

> {P{Y = si)+uJo)E{rj{si) 

> P + p^Y^^si) ~ l"^(^)l(l +^^0) 

> K + (jJqK - |cr(iV)|(l + Wo), 
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and the expected payoff if the person reports S 2 instead is 


= P(Yj, = S2\Yj = si)E{rj{s2)) 

< {P{Y = S2) - uJo)E{r^{s2)) 

= K- p(P=s2) +l"^(^)l(l+^0) 

< K - loqK + |cr(iV)|(l + Wo). 

Thus for N > Nq, where Nq depends only on eo, (5o and K, being truthful maximizes the expected 
payment. □ 

5.2 Remarks 

A Mechanism for any N: It may be tempting to think that one can achieve incentive-compatibility 
in expectation if instead of using fj{sk) as an asymptotically accurate estimate of P(T = Sfe), one 
can simply draw a person j' from just one population, say i, and use fj{sk) = com¬ 
pute the scores rj{sk)- But although E{fj{sk)) = P{Y = Sk), clearly E{jir^) ^ Hence 

incentive-compatibility does not hold in general. Nevertheless, if we can depart from the format of 
an output agreement mechanism, then one can retrieve incentive-compatibility for any finite N using 
the following "additive" mechanism (again assuming binary evaluations and regularity of filters). 


Mechanism 3: Het-additive: Heterogeneous population with binary evaluations and regular 
filters _ 

• The observations of all the people for the different objects are are solicited. Let these be 
denoted by 

• For each person j in population M*, pick a person j' from the same population M*, and 
another person j” from a different population M* . The reward of person j is: 

+ l{y;^y;,}]> 

where A > 0 is an arbitrarily chosen positive constant. 


To see that this mechanism is truthful, first note that the reward to person j, can equivalently 
written as 


K + K 




Ignoring the constant reward K, the expected additional payment in the mechanism if she tells the 
truth, if she makes an evaluation si, is K\j^(Yp = si\Yj = si) — P{Y = si)]. But by the regularity 
assumption, as mentioned above, we have shown in the proof of Theorem that P{Yp = si\Yj = 
si) — P{Y = si) > 0, and hence PiY^ = S 2 \Y^ = si) — P{Y = S 2 ) < 0. Thus the mechanism 
is strictly detail-free Bayes-Nash incentive compatible. |SAFP1^ define the following condition for 
non-binary signals in the heterogeneous population setting: P{Yp = s\Yj = s) — PiY = s) > 0 
and PiYJ, = s'\Yj = s) — P{Y = s') < 0 for every s' 7 ^ s, and they present a mechanism that is 
truthful under this condition. It is easy to see that if a generating model satisfies this property in 
our setting, then the Het-additive mechanism is strictly detail-free Bayes-Nash incentive compatible 
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with respect to this generating model. But again it is not clear if there are practically reasonable 
structural assumptions on the generating model that imply this property. 

Using Mechanism 2 for the homogeneous case with binary evaluations: Mechanism 2 is 
asymptotically truthful for the case of a homogeneous population for binary-choice evaluation tasks 
under the appropriate regularity assumptions. Note that in that case since everyone has the same 
filter, the 0-regularity condition is immediately satisfied. Also note that Mechanism 2 is not truthful 
in general in the homogeneous setting for non-binary evaluation tasks. 

Other equilibria: Again in this case there are other equilibria of the class where everybody 
reports an independent sampling from some common distribution, but in expectation they give a 
lower payoff than the truthful equilibrium. Similar to what we saw in the homogeneous setting, 
asymptotically (in N) the expected reward is K under this equilibrium. But it is clear from the proof 
of Theorem that if the filters are 5o-regular, the expected reward under the truthful equilibrium 
asymptotically is strictly greater than K. 


6 Experiments 

Motivated by the significant emphasis on simplicity of mechanisms in the community |Rubl51 ILilSi 
ISiml5] , we performed experiments on the Amazon Mechanical Turk crowdsourcing platform to test 
the ‘understandability’ or ‘simplicity’ of the structure of our mechanisms as compared to prior art. 
By simplicity, we roughly mean the ability of the mechanism to induce optimal behavior in agents 
who are presented with it. Indeed, “simplicity” of the mechanism bears intrinsic value and several 
applications demand an understanding of the underlying mechanism: for instance, students wish to 
know how their grades are computed, voters wish to know what electoral system is used etc. In what 
follows, we first describe the experimental setup. Alongside, we also detail the various challenges 
associated to such a study on Amazon Mechanical Turk and our approaches in overcoming these 
challenges, which may be of independent interest. Subsequently, we detail the outcomes of the 
experiment. 


6.1 Experimental setup and challenges 


There are several challenges in designing experiments to study such a characteristic of these mecha¬ 
nisms. Tasks on MTurk are usually objectively verifiable and include image recognition, transcrip¬ 
tion and matching tasks |RYZ+in| . As a consequence, workers may expect that their work will 
be verified and rewarded according to their objective accuracy |SZ15j . This feature introduces a 
significant bias for testing incentive compatible mechanisms, as workers have a tendency to report 
the truth |GMCAl4] . This effect is amplified when the worker himself or herself is the “agent” in 
the mechanism. Moreover, in order to garner interest from worker, and thereby obtain quality data 
for the experiment, it is quite essential to make the task interesting [KSVlll IBKGll] . 


In order to address these issues, we present the worker with a game-like setting, where he or she 
interacts with the experiment via a character (see Figure 2(a) for an example). The goal of the 
worker is to maximize the payment obtained by the character. We explicitly account for the worker’s 
bias through the introduction of an ‘inverted mechanism’ for each of our candidate mechanisms, that 
results from a trivial modification in which all the payments to the agent in the original mechanism 
are presented as penalties to the worker (over a base reward that ensures that the net payments are 
non-negative). Hence in these inverted mechanisms, truth-telling is no longer an optimal strategy: 
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in the binary setting, lying is an optimal strategy. Any worker is shown either the normal or the 
inverted setting, chosen uniformly at random. We also perform an additional randomization of the 
signal (“A” or “B”) observed by the character in the experiment. The game-based setting is further 
made interesting via illustrations, examples, and simple-to-understand language for the mechanism 
descriptions. 

We compared the HET-OA mechanism of the present paper with the mechanism of |R.F15] (which 
we will denote as RF15)j^ 

In more detail, the experiment was structured as follows: 

• Upon visiting the platform, workers are presented with a fictional character “Sam", whom they 
will help. Sam’s observation (ground truth) is provided to the worker, as well as a question 
eliciting that observation. 

• In the setting, Sam is the grader for an essay in high school, and the principal of the school 
has announced a reward/penalty according to the quality of grades given by each student. 
The goal of the worker is to help Sam choose the grade (out of {A,B}) that will maximize 
Sam’s reward (or minimize Sam’s penalty). 

• Each worker is now presented with one of the eight settings (we sample, uniformly at random, 
which mechanism to present, whether to invert it and which private signal to show) at random 
and asked whether Sam should answer the question truthfully in order to maximize his payoff 
assuming everyone else answers truthfully. Each worker only interacts with one setting to 
minimize possible bias. 

• In addition to providing a (binary) recommendation for Sam, workers are also asked to provide 
a brief justification. 

Note that the worker is explained the mechanism but is not told whether the mechanism is attempt¬ 
ing to induce the truth or a lie from the worker. Each worker is offered a fixed payment of 30 cents, 
and contingent on giving the correct answer and a reasonable explanation, an additional bonus of 
the same amount. 

6.2 Mechanism Descriptions 

Since our goal was to gauge the ‘understandability‘ of our mechanisms in the general population, 
we did not wish to restrict our experiment to agents with a background in mathematics or quan¬ 
titative sciences. This consequently meant that we had to explain both mechanisms without any 
mathematical expressions, beyond simple operations like addition and multiplication. We presented 
our mechanism by means of a lookup table and presented the RE mechanism (where there is no 
intuitive use for such a table) with a sequence of 3 steps of simple addition. Both mechanisms were 
accompanied with an example scenario where the payment computation steps are shown along with 
the final payment. 

In the appendix, we present the exact text of the description of the mechanisms given to the work¬ 
ers along with the belief structure of the fictional character. This allowed us to quickly categorize 
their answers as correct or incorrect for our analysis. 

"^The mechanism descriptions, data and a link to the actual survey are available at 
https://eecs.berkeley.edu/~nihar/data/nogold 
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The Wazoo Grading Experiment! 



* The principal of Wazoo High school is having thj 
12th grade students (seniors) grade history essa 
of the 9th grade students (freshmen), 




I 

!! 


Each essay will be graded separately by two 
seniors. There are only two grades, A and B. 

The principal has come up with a scheme to 
reward the seniors so that they grade well, 


o 


In this task, you need to help grader Sam 
maximize his reward. 



After you click 'begin' you will s 
scheme. 


Mechanism 


(a) Illustration of the interface presented to (b) Fraction of workers who answered correctly (that is, 
the workers. according to the requirements of that mechanism). 


Figure 2: The platform and the results from experiments on Amazon Mechanical Turk comparing the 
RF mechanism |RF15) and the HET-OA mechanism proposed in the present paper. The difference 
between the two mechanisms is statistically significant (p < 0.05). 


6.3 Results and analysis 


Figure 2(b) plots the results from the experiments, showing the fraction of workers who answered 
correctly (that is, according to what the mechanism was designed to incentivize). We obtained a 
total of 223 workers, with 114 workers being shown our mechanism and 109 workers shown the RF 
mechanism. 

In more detail, the number of workers (n) and the fraction of correct responses (/x) and the 
standard error of the mean (e) associated to each sub-class of the tasks is as follows: 


Mechanism 

n 


e 

HET-OA 

61 

0.803 

0.051 

Inverted HET-OA 

53 

0.528 

0.069 

RF 

60 

0.500 

0.065 

Inverted RF 

49 

0.592 

0.070 


Observe from the table that on average, the workers’ responses are more accurate under HET- 
OA than RF. We employ the standard two-sample t-test |Sta00j between the two sets of data in 
order to investigate whether this difference is statistically significant or not. The t-test is used 
to test whether the samples in two sets of data are drawn from an underlying distribution with 
identical means (the null hypothesis) or not (the alternative hypothesis). In our case, the mean is 
simply the mean of a Bernoulli distribution that dictates the likelihood of a worker’s response being 
correct. The t-test reveals that HET-OA mechanism elicited a statistically significant number of 
more correct responses (with a p-value of 0.02) as compared to RF, indicating that the mechanism 
proposed in this paper was more understandable to the workers as compared to the RF mechanism. 

All in all, these experiments indicate, in a statistically significant manner, that output agreement- 
type mechanisms with popularity-scaled rewards such as those proposed in the present paper are 
more intuitive and understandable, and hence more successful in inducing truthful behavior. 
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7 Conclusion 


We presented mechanisms for obtaining truthful reports with minimal elicitation. Our mechanisms 
support the setting where agents are assumed to be homogeneous, and also support heterogeneous 
workers when questions are of binary-choice format. The mechanisms rely on the existence of many 
questions, a feature commonly encountered in the settings of crowd-sourcing and peer-grading. We 
experimentally tested our mechanisms and found {p = 0.02) that they are more understandable and 
better at inducing truthful behavior than the current state of the art. Interestingly, our mechanisms 
are built under a novel framework that is a significant departure from the traditional setup of proper 
scoring rules. 

Our broader objective is to construct mechanisms that incentivize truthful reports in the absence 
of any ‘gold standard’ questions, that are also viable in practice. The results of this paper take 
a signihcant step in this direction. A question left open in this manuscript is that of the general 
|S| >2 setting with a heterogeneous population. It would be interesting to find the right notion 
of "regularity" of the agents’ hlters (that is weaker than the condition in |SAFP16] or in |RFJ16) 1 
that would allow one to design a strictly detail-free Bayes-Nash incentive compatible mechanism in 
that case. 
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8 Appendix 

8.1 Mechanism Descriptions 

These descriptions pertain to the non-inverted case. 

HET-OA Mechanism Description Suppose that Lisa is grading the same essay as Sam. Let’s 
see what reward Sam gets. First, he gets $1 as a starting reward. Then he may also get a bonus 
reward. How much? That depends. If he and Lisa give different grades (e.g., Sam gives an A and 
Lisa gives a B), then Sam gets no bonus. However, if he and Lisa give the same grade, then he 
will get a bonus that depends on how popular the grade is overall. Higher the popularity, lower the 
bonus. 

In general, if x% of the class got an A and y% got a B (of course, x+y must add up to 100) then 
Sam’s bonus for the different possibilities is given in the table below. 



Lisa gives A 

Lisa gives B 

Sam gives A 

100/x 

0 

Sam gives B 

0 

100/2/ 


For example, if 40% of the grades were A and 60% B, and if Lisa and Sam both give A, then Sam 
gets a bonus of $2.5, but if both give a B, then their bonus is only $1.67 since B is more popular. 

RF Mechanism Description Since every essay is graded by two people, we will call those two 
students partners. Suppose that Lisa is Sam’s partner. First, for every other essay, one of the two 
graders that have evaluated it is chosen to form a collection of graders. If all the grades given by 
this collection do not have 2 As and 2 Bs, then Sam does not get any reward and the scheme ends. 

If they have 2 As and 2 Bs, then choose two graders from this collection who gave the same 
grade to their essays as Sam. Suppose these graders are Bob and Mike. Let Bob’s partner be Alice 
and let Mike’s partner be Nicole. 

Sam is then rewarded as follows: 
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• He gets a starting reward of $0.50. 

• He gets a bonus of $1 if Lisa’s grade is same as Alice’s. 

• But then he pays a penalty of $0.5 if Alice’s grade is same as Nicole’s. 

For instance, if Sam gives an B, while Lisa, Nicole and Alice give an A, Sam would receive a total 
reward of 0.5 + 1 — 0.5 = $1. 


Belief Structure In addition to the description of the mechanism, workers were also presented 
with three sentences pertaining to the beliefs of the fictional character Sam: 

1. Sam believes that 20% of the essays will get an A and 80% will get a B if everyone grades 
honestly. 

2. Suppose that Sam thinks that his essay deserves an A, and he thinks that Lisa will also give 
it an A with a 40% chance. 

Assuming that every other grader is going to grade honestly, what grade should Sam report to 
maximize his reward? 
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